16 research outputs found

    FAT: An In-Memory Accelerator with Fast Addition for Ternary Weight Neural Networks

    Full text link
    Convolutional Neural Networks (CNNs) demonstrate excellent performance in various applications but have high computational complexity. Quantization is applied to reduce the latency and storage cost of CNNs. Among the quantization methods, Binary and Ternary Weight Networks (BWNs and TWNs) have a unique advantage over 8-bit and 4-bit quantization. They replace the multiplication operations in CNNs with additions, which are favoured on In-Memory-Computing (IMC) devices. IMC acceleration for BWNs has been widely studied. However, though TWNs have higher accuracy and better sparsity than BWNs, IMC acceleration for TWNs has limited research. TWNs on existing IMC devices are inefficient because the sparsity is not well utilized, and the addition operation is not efficient. In this paper, we propose FAT as a novel IMC accelerator for TWNs. First, we propose a Sparse Addition Control Unit, which utilizes the sparsity of TWNs to skip the null operations on zero weights. Second, we propose a fast addition scheme based on the memory Sense Amplifier to avoid the time overhead of both carry propagation and writing back the carry to memory cells. Third, we further propose a Combined-Stationary data mapping to reduce the data movement of activations and weights and increase the parallelism across memory columns. Simulation results show that for addition operations at the Sense Amplifier level, FAT achieves 2.00X speedup, 1.22X power efficiency, and 1.22X area efficiency compared with a State-Of-The-Art IMC accelerator ParaPIM. FAT achieves 10.02X speedup and 12.19X energy efficiency compared with ParaPIM on networks with 80% average sparsity.Comment: 14 page

    XOR-Net : an efficient computation pipeline for binary neural network inference on edge devices

    No full text
    Accelerating the inference of Convolution Neural Networks (CNNs) on edge devices is essential due to the small memory size and poor computation capability of these devices. Network quantization methods such as XNOR-Net, Bi-Real-Net, and XNOR-Net++ reduce the memory usage of CNNs by binarizing the CNNs. They also simplify the multiplication operations to bit-wise operations and obtain good speedup on edge devices. However, there are hidden redundancies in the computation pipeline of these methods, constraining the speedup of those binarized CNNs. In this paper, we propose XOR-Net as an optimized computation pipeline for binary networks both without and with scaling factors. As XNOR is realized by two instructions XOR and NOT on CPU/GPU platforms, XOR-Net avoids NOT operations by using XOR instead of XNOR, thus reduces bit-wise operations in both aforementioned kinds of binary convolution layers. For the binary convolution with scaling factors, our XOR-Net further rearranges the computation sequence of calculating and multiplying the scaling factors to reduce full-precision operations. Theoretical analysis shows that XOR-Net reduces one-third of the bit-wise operations compared with traditional binary convolution, and up to 40\% of the full-precision operations compared with XNOR-Net. Experimental results show that our XOR-Net binary convolution without scaling factors achieves up to 135X speedup and consumes no more than 0.8% energy compared with parallel full-precision convolution. For the binary convolution with scaling factors, XOR-Net is up to 17% faster and 19% more energy-efficient than XNOR-Net.Ministry of Education (MOE)Nanyang Technological UniversityAccepted versionThis work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1- 001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087)

    Person re-identification via pose-aware multi-semantic learning

    No full text
    Person re-identification (ReID) remains an open-ended research topic, with its variety of substantial applications such as tracking, searching, etc. Existing methods mostly explore the highest-semantic feature embedding, ignoring the insights hidden among the earlier layers. Moreover, owing to the misalignment and pose variations, pose-related information is of great significance and needs to be comprehensively utilized. In this paper, we present a novel person ReID framework called Pose-aware Multi-semantic Fusion Network (PMFN). First, taking into account multiple semantics, we propose Multi-semantic Fusion Network (MFN) as the backbone, employing several shortcuts to reserve bypass feature maps for subsequent fusion. Second, to learn a pose-sensitive embedding, pose-aware clues are considered, forming the complete PMFN and investigating the well-aligned global and local body regions. Finally, the center loss is introduced for enhancing the feature discriminability. Exhaustive experiments on two large-scale person ReID benchmarks demonstrate the strengths of our approach over recent state-of-the-art works.Ministry of Education (MOE)Nanyang Technological UniversitySubmitted/Accepted versionThis work is partially supported by MoE AcRF Tier 2 MOE2019-T2-1-071 and Tier 1 MOE2019-T1-001-072, NTU NAP M4082282 and SUG M4082087, Singapore

    TAB : unified and optimized ternary, binary and mixed-precision neural network inference on the edge

    No full text
    Ternary Neural Networks (TNNs) and mixed-precision Ternary Binary Networks (TBNs) have demonstrated higher accuracy compared to Binary Neural Networks (BNNs) while providing fast, low-power and memory-efficient inference. Related works have improved the accuracy of TNNs and TBNs, but overlooked their optimizations on CPU and GPU platforms. First, there is no unified encoding for the binary and ternary values in TNNs and TBNs. Second, existing works store the 2-bit quantized data sequentially in 32/64-bit integers, resulting in bit-extraction overhead. Last, adopting standard 2-bit multiplications for ternary values leads to a complex computation pipeline, and efficient mixed-precision multiplication between ternary and binary values is unavailable. In this paper, we propose TAB as a unified and optimized inference method for ternary, binary and mixed-precision neural networks. TAB includes unified value representation, efficient data storage scheme, and novel bitwise dot product pipelines on CPU/GPU platforms. We adopt signed integers for consistent value representation across binary and ternary values. We introduce a bitwidth-last data format that stores the first and second bits of the ternary values separately to remove the bit extraction overhead. We design the ternary and binary bitwise dot product pipelines based on Gated-XOR using up to 40% fewer operations than State-Of-The-Art (SOTA) methods. Theoretical speedup analysis shows that our proposed TAB-TNN is 2.3X fast as the SOTA ternary method RTN, 9.8X fast as 8-bit integer quantization (INT8), and 39.4X fast as 32-bit full-precision convolution (FP32). Experiment results on CPU and GPU platforms show that our TAB-TNN has achieved up to 34.6X speedup and 16X storage size reduction compared with FP32 layers. TBN, Binary-activation Ternary-weight Network (BTN) and BNN in TAB are up to 40.7X, 56.2X and 72.2X fast as FP32. TAB-TNN is up to 70.1% faster and 12.8% more power-efficient than RTN on Darknet-19 while keeping the same accuracy. TAB is open source as a PyTorch Extension for easy integration with existing CNN models.Ministry of Education (MOE)Nanyang Technological UniversitySubmitted/Accepted versionThis work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1-071) and Tier 1 (MOE2019-T1-001-072), and partially supported by Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087)

    ArSMART : an improved SMART NoC design supporting arbitrary-turn transmission

    No full text
    SMART NoC, which transmits unconflicted flits to distant processing elements (PEs) in one cycle through the express bypass, is a high-performance NoC design proposed recently. However, if contention occurs, flits with low priority would not only be buffered but also could not fully utilize bypass. Although there exist several routing algorithms that decrease contentions by rounding busy routers and links, they cannot be directly applicable to SMART since it lacks the support for arbitrary-turn (i.e. the number and direction of turns are free of constraints) routing. Thus, in this article, to minimize contentions and further utilize bypass, we propose an improved SMART NoC, called ArSMART, in which arbitrary-turn transmission is enabled. Specifically, ArSMART divides the whole NoC into multiple clusters where the route computation is conducted by the cluster controller and the data forwarding is performed by the bufferless reconfigurable router. Since the long-range transmission in SMART NoC needs to bypass the intermediate arbitration, to enable this feature, we directly configure the input and output ports connection rather than apply hop-by-hop table-based arbitration. To further explore the higher communication capabilities, effective adaptive routing algorithms that are compatible with ArSMART are proposed. The route computation overhead, one of the main concerns for adaptive routing algorithms, is hidden by our carefully designed control mechanism. Compared with the state-of-the-art SMART NoC, the experimental results demonstrate an average reduction of 40.7% in application schedule length and 29.7% in energy consumption.Ministry of Education (MOE)Nanyang Technological UniversityThis work is partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MoE2019-T2-1-071) and Tier 1 (MoE2019-T1-001-072), and Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087)

    Contention-aware routing for thermal-reliable optical networks-on-chip

    No full text
    Optical network-on-chip (ONoC) architecture offers ultrahigh bandwidth, low latency, and low power dissipation for new-generation manycore systems. However, the benefits in communication performance and energy efficiency will be diminished by communication contention. The intrinsic thermal susceptibility is another challenge for ONoC designs. Under on-chip temperature variations, core functional devices suffer from significant thermal-induced optical power loss, which seriously threatens ONoCs’ reliability. In this article, we develop novel routing techniques to resolve both issues for ONoCs. By analyzing the thermal effect in ONoCs, we first present a routing criterion at the network level. Combined with device-level thermal tuning, it can implement thermal-reliable ONoCs. Two routing approaches, including a mixed-integer linear programming (MILP) model and a heuristic algorithm (called CAR), are further proposed to minimize communication conflicts based on guaranteed thermal reliability, and meanwhile, maximize the communication energy efficiency in the presence of on-chip thermal variations. By applying the criterion, our approaches achieve excellent performance with largely reduced complexity of design space exploration. The evaluation results based on both synthetic traffic patterns and realistic benchmarks validate the effectiveness of our approaches with an average of 126.95% improvement in communication performance and 16.12% reduction in energy overhead compared to state-of-the-art techniques. CAR only introduces 7.20% performance difference compared to the MILP model and is more scalable to large-size ONoCs.Ministry of Education (MOE)Accepted versionThis work is partially supported by the the National Natural Science Foundation of China (NSFC No. 61772094), and partially supported by the Ministry of Education, Singapore, under its Academic Research Fund Tier 2 (MOE2019-T2-1- 071) and Tier 1 (MOE2019-T1-001-072), and Nanyang Technological University, Singapore, under its NAP (M4082282) and SUG (M4082087)

    Fat-Tree-Based Optical Interconnection Networks Under Crosstalk Noise Constraint

    No full text
    Optical networks-on-chip (ONoCs) have shown the potential to be substituted for electronic networks-on-chip (NoCs) to bring substantially higher bandwidth and more efficient power consumption in both on-and off-chip communication. However, basic optical devices, which are the key components in constructing ONoCs, experience inevitable crosstalk noise and power loss; the crosstalk noise from the basic devices accumulates in large-scale ONoCs and considerably hurts the signal-to-noise ratio (SNR) as well as restricts the network scalability. For the first time, this paper presents a formal system-level analytical approach to analyze the worst-case crosstalk noise and SNR in arbitrary fat-tree-based ONoCs. The analyses are performed hierarchically at the basic optical device level, then at the optical router level, and finally at the network level. A general 4 x 4 optical router model is considered to enable the proposed method to be adaptable to fat-tree-based ONoCs using an arbitrary 4x4 optical router. Utilizing the proposed general router model, the worst-case SNR link candidates in the network are determined. Moreover, we apply the proposed analyses to a case study of fat-tree-based ONoCs using an optical turnaround router (OTAR). Quantitative simulation results indicate low values of SNR and scalability constraints in large scale fat-tree-based ONoCs, which is due to the high power of crosstalk noise and power loss. For instance, in fat-tree-based ONoCs using the OTAR, when the injection laser power equals 0 dBm, the crosstalk noise power is higher than the signal power when the number of processor cores exceeds 128; when it is equal to 256, the signal power, crosstalk noise power, and SNR are -17.3, -11.9, and -5.5 dB, respectively

    Fault-tolerant routing mechanism in 3D optical network-on-chip based on node reuse

    No full text
    The three-dimensional Network-on-Chips (3D NoCs) has become a mature multi-core interconnection architecture in recent years. However, the traditional electrical lines have very limited bandwidth and high energy consumption, making the photonic interconnection promising for future 3D Optical NoCs (ONoCs). Since existing solutions cannot well guarantee the fault-tolerant ability of 3D ONoCs, in this paper, we propose a reliable optical router (OR) structure which sacrifices less redundancy to obtain more restore paths. Moreover, by using our fault-tolerant routing algorithm, the restore path can be found inside the disabled OR under the deadlock-free condition, i.e., fault-node reuse. Experimental results show that the proposed approach outperforms the previous related works by maximum 81.1 percent and 33.0 percent on average for throughput performance under different synthetic and real traffic patterns. It can improve the system average optical signal to noise ratio (OSNR) performance by maximum 26.92 percent and 12.57 percent on average, and it can improve the average energy consumption performance by 0.3 percent to 15.2 percent under different topology types/sizes, failure rates, OR structures, and payload packet sizes
    corecore